The aim of the analysis is to build a prediction model that can estimate the average cost for 2 people to stay in an AirBnB in Amsterdam for 4 nights. As such, we analysed data about Airbnb listings in the city and identified predictors that are highly relevant to the price and fitted several models. Those predictors are selected first based on common sense and were analysed in terms of their price expanatory power. We then tested several models in order to find out which could predict the price better. The best model was chosen based on how well it helps to predict the prices given different inputs. The price is then estimated based on the best model. The best model includes variables such as property type, number_of_reviews, review_scores_rating, bathrooms, bedrooms, accommodates, host_is_superhost, neighbourhood_simplified, availability_30 and reviews_per_month. These are consistent with common sense metrics that would lead to increases or decreases in price.
Even though there are many variables in the dataframe, here is a quick description of some of the variables collected
price = cost per night
property_type: type of accommodation (House, Apartment, etc.)
room_type:
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
neighbourhood*: three variables on a few major neighbourhoods in each city
In the R4DS Exploratory Data Analysis chapter, the authors state:
“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation… EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”
Conduct a thorough EDA. Recall that an EDA involves three things:
dplyr::glimpse()mosaic::favstats()skimr::skim()ggplot2::ggplot()
geom_histogram() or geom_density() for numeric continuous variablesgeom_bar() or geom_col() for categorical variablesGGally::ggpairs() for scaterrlot/correlation matrix
aes call, for example: aes(colour = gender, alpha = 0.4)You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:
At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!
In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.
Once you load the data, it’s always a good idea to use glimpse to see what kind of variables you have and what data type (chr, num, logical, date, etc) they are.
Notice that some of the price data (price) is given as a character string, e.g., “$176.00”
Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number
Use typeof(listing$price) to confirm that price is now stored as a number.
glimpse(listings)
Rows: 16,116
Columns: 74
$ id <dbl> 2818, 20168, 25428, 27886…
$ listing_url <chr> "https://www.airbnb.com/r…
$ scrape_id <dbl> 2.021091e+13, 2.021091e+1…
$ last_scraped <date> 2021-09-07, 2021-09-07, …
$ name <chr> "Quiet Garden View Room &…
$ description <chr> "Quiet Garden View Room &…
$ neighborhood_overview <chr> "Indische Buurt (\"Indies…
$ picture_url <chr> "https://a0.muscache.com/…
$ host_id <dbl> 3159, 59484, 56142, 97647…
$ host_url <chr> "https://www.airbnb.com/u…
$ host_name <chr> "Daniel", "Alexander", "J…
$ host_since <date> 2008-09-24, 2009-12-02, …
$ host_location <chr> "Amsterdam, Noord-Holland…
$ host_about <chr> "Upon arriving in Amsterd…
$ host_response_time <chr> "within an hour", "within…
$ host_response_rate <chr> "100%", "100%", "N/A", "8…
$ host_acceptance_rate <chr> "100%", "100%", "0%", "10…
$ host_is_superhost <lgl> TRUE, FALSE, TRUE, TRUE, …
$ host_thumbnail_url <chr> "https://a0.muscache.com/…
$ host_picture_url <chr> "https://a0.muscache.com/…
$ host_neighbourhood <chr> "Indische Buurt", "Gracht…
$ host_listings_count <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_total_listings_count <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_verifications <chr> "['email', 'phone', 'revi…
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE, …
$ neighbourhood <chr> "Amsterdam, North Holland…
$ neighbourhood_cleansed <chr> "Oostelijk Havengebied - …
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude <dbl> 52.36435, 52.36407, 52.37…
$ longitude <dbl> 4.94358, 4.89393, 4.88487…
$ property_type <chr> "Private room in rental u…
$ room_type <chr> "Private room", "Private …
$ accommodates <dbl> 2, 2, 3, 2, 2, 1, 2, 3, 2…
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text <chr> "1.5 shared baths", "1 pr…
$ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ beds <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2…
$ amenities <chr> "[\"Single level home\", …
$ price <chr> "$59.00", "$106.00", "$12…
$ minimum_nights <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ maximum_nights <dbl> 28, 365, 120, 730, 1825, …
$ minimum_minimum_nights <dbl> 3, 1, 7, 2, 2, 2, 90, 4, …
$ maximum_minimum_nights <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ minimum_maximum_nights <dbl> 1125, 1125, 120, 1125, 18…
$ maximum_maximum_nights <dbl> 1125, 1125, 120, 1125, 18…
$ minimum_nights_avg_ntm <dbl> 3.0, 1.0, 13.8, 2.0, 2.0,…
$ maximum_nights_avg_ntm <dbl> 1125, 1125, 120, 1125, 18…
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30 <dbl> 3, 0, 1, 9, 11, 8, 4, 0, …
$ availability_60 <dbl> 28, 0, 1, 20, 27, 23, 34,…
$ availability_90 <dbl> 55, 0, 3, 47, 50, 50, 64,…
$ availability_365 <dbl> 124, 0, 57, 66, 298, 313,…
$ calendar_last_scraped <date> 2021-09-07, 2021-09-07, …
$ number_of_reviews <dbl> 280, 339, 5, 223, 353, 49…
$ number_of_reviews_ltm <dbl> 2, 0, 0, 4, 19, 17, 0, 0,…
$ number_of_reviews_l30d <dbl> 0, 0, 0, 2, 8, 10, 0, 0, …
$ first_review <date> 2013-08-25, 2014-01-17, …
$ last_review <date> 2019-11-21, 2020-03-27, …
$ review_scores_rating <dbl> 4.89, 4.44, 5.00, 4.95, 4…
$ review_scores_accuracy <dbl> 4.93, 4.69, 5.00, 4.93, 4…
$ review_scores_cleanliness <dbl> 5.00, 4.79, 5.00, 4.96, 4…
$ review_scores_checkin <dbl> 4.97, 4.63, 5.00, 4.95, 4…
$ review_scores_communication <dbl> 4.97, 4.62, 5.00, 4.92, 4…
$ review_scores_location <dbl> 4.68, 4.87, 5.00, 4.90, 4…
$ review_scores_value <dbl> 4.81, 4.49, 4.80, 4.80, 4…
$ license <chr> "0363 5F3A 5684 6750 D14D…
$ instant_bookable <lgl> TRUE, TRUE, FALSE, TRUE, …
$ calculated_host_listings_count <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2…
$ calculated_host_listings_count_entire_homes <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 2…
$ calculated_host_listings_count_private_rooms <dbl> 1, 2, 0, 1, 2, 2, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month <dbl> 2.86, 3.64, 0.11, 2.14, 4…
favstats(listings$id)
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 2.82e+03 | 1.02e+07 | 1.93e+07 | 3.11e+07 | 5.21e+07 | 2.12e+07 | 1.35e+07 | 16116 | 0 |
skim(listings)
| Name | listings |
| Number of rows | 16116 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 24 |
| Date | 5 |
| logical | 8 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 16116 | 0 |
| name | 30 | 1.00 | 1 | 240 | 0 | 15766 | 0 |
| description | 223 | 0.99 | 1 | 1000 | 0 | 15711 | 0 |
| neighborhood_overview | 5711 | 0.65 | 1 | 1000 | 0 | 9495 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 15968 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 14197 | 0 |
| host_name | 5 | 1.00 | 1 | 33 | 0 | 5166 | 0 |
| host_location | 35 | 1.00 | 2 | 86 | 0 | 560 | 0 |
| host_about | 7066 | 0.56 | 1 | 8920 | 0 | 7522 | 16 |
| host_response_time | 5 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 5 | 1.00 | 2 | 4 | 0 | 57 | 0 |
| host_acceptance_rate | 5 | 1.00 | 2 | 4 | 0 | 91 | 0 |
| host_thumbnail_url | 5 | 1.00 | 55 | 106 | 0 | 14161 | 0 |
| host_picture_url | 5 | 1.00 | 57 | 109 | 0 | 14161 | 0 |
| host_neighbourhood | 5713 | 0.65 | 4 | 35 | 0 | 68 | 0 |
| host_verifications | 0 | 1.00 | 2 | 158 | 0 | 361 | 0 |
| neighbourhood | 5711 | 0.65 | 11 | 59 | 0 | 67 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 38 | 0 | 22 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 67 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 21 | 1.00 | 6 | 17 | 0 | 26 | 0 |
| amenities | 0 | 1.00 | 2 | 1488 | 0 | 14667 | 0 |
| price | 0 | 1.00 | 5 | 9 | 0 | 498 | 0 |
| license | 11561 | 0.28 | 6 | 63 | 0 | 3624 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-07 | 2021-09-08 | 2021-09-07 | 2 |
| host_since | 5 | 1.00 | 2008-09-24 | 2021-08-31 | 2015-04-07 | 3235 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-07 | 2021-09-08 | 2021-09-07 | 2 |
| first_review | 2087 | 0.87 | 2011-04-19 | 2021-09-07 | 2017-10-30 | 2467 |
| last_review | 2087 | 0.87 | 2011-06-12 | 2021-09-07 | 2019-06-10 | 2019 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 5 | 1 | 0.12 | FAL: 14103, TRU: 2008 |
| host_has_profile_pic | 5 | 1 | 1.00 | TRU: 16078, FAL: 33 |
| host_identity_verified | 5 | 1 | 0.67 | TRU: 10720, FAL: 5391 |
| neighbourhood_group_cleansed | 16116 | 0 | NaN | : |
| bathrooms | 16116 | 0 | NaN | : |
| calendar_updated | 16116 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.96 | TRU: 15457, FAL: 659 |
| instant_bookable | 0 | 1 | 0.23 | FAL: 12408, TRU: 3708 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.118184e+07 | 13520629.26 | 2.818000e+03 | 1.018020e+07 | 1.926593e+07 | 3.107548e+07 | 5.208280e+07 | ▇▇▆▅▂ |
| scrape_id | 0 | 1.00 | 2.021091e+13 | 0.00 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 6.976052e+07 | 90271635.64 | 3.159000e+03 | 9.735558e+06 | 2.974134e+07 | 8.988321e+07 | 4.210037e+08 | ▇▂▁▁▁ |
| host_listings_count | 5 | 1.00 | 2.020000e+00 | 23.34 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.992000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 5 | 1.00 | 2.020000e+00 | 23.34 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.992000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 5.237000e+01 | 0.02 | 5.229000e+01 | 5.236000e+01 | 5.236000e+01 | 5.238000e+01 | 5.243000e+01 | ▁▁▇▅▁ |
| longitude | 0 | 1.00 | 4.890000e+00 | 0.04 | 4.760000e+00 | 4.860000e+00 | 4.890000e+00 | 4.910000e+00 | 5.070000e+00 | ▁▆▇▁▁ |
| accommodates | 0 | 1.00 | 2.840000e+00 | 1.31 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▃▁▁▁ |
| bedrooms | 898 | 0.94 | 1.530000e+00 | 0.95 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+01 | ▇▁▁▁▁ |
| beds | 97 | 0.99 | 1.760000e+00 | 1.47 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.300000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 3.990000e+00 | 20.99 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.081500e+02 | 540.67 | 1.000000e+00 | 2.100000e+01 | 1.125000e+03 | 1.125000e+03 | 1.825000e+03 | ▇▁▁▇▁ |
| minimum_minimum_nights | 3 | 1.00 | 3.950000e+00 | 20.99 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 3 | 1.00 | 4.150000e+00 | 21.03 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 3 | 1.00 | 6.855200e+02 | 532.48 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 1.825000e+03 | ▆▁▁▇▁ |
| maximum_maximum_nights | 3 | 1.00 | 2.672465e+05 | 23924509.45 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 3 | 1.00 | 4.030000e+00 | 21.00 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 3 | 1.00 | 2.666400e+05 | 23870381.65 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 2.142625e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 4.030000e+00 | 8.70 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| availability_60 | 0 | 1.00 | 8.840000e+00 | 18.14 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 6.000000e+01 | ▇▁▁▁▁ |
| availability_90 | 0 | 1.00 | 1.432000e+01 | 28.51 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.000000e+00 | 9.000000e+01 | ▇▁▁▁▁ |
| availability_365 | 0 | 1.00 | 5.532000e+01 | 107.91 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.700000e+01 | 3.650000e+02 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 2.465000e+01 | 56.71 | 0.000000e+00 | 2.000000e+00 | 8.000000e+00 | 2.200000e+01 | 8.770000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 1.390000e+00 | 7.48 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.220000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 3.300000e-01 | 2.03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.370000e+02 | ▇▁▁▁▁ |
| review_scores_rating | 2087 | 0.87 | 4.690000e+00 | 0.67 | 0.000000e+00 | 4.670000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 2301 | 0.86 | 4.810000e+00 | 0.35 | 0.000000e+00 | 4.750000e+00 | 4.910000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 2300 | 0.86 | 4.700000e+00 | 0.44 | 0.000000e+00 | 4.600000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 2309 | 0.86 | 4.850000e+00 | 0.31 | 0.000000e+00 | 4.820000e+00 | 4.950000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 2304 | 0.86 | 4.870000e+00 | 0.30 | 0.000000e+00 | 4.850000e+00 | 4.980000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 2309 | 0.86 | 4.730000e+00 | 0.33 | 0.000000e+00 | 4.600000e+00 | 4.810000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 2309 | 0.86 | 4.600000e+00 | 0.39 | 0.000000e+00 | 4.500000e+00 | 4.670000e+00 | 4.830000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 1.640000e+00 | 2.40 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.100000e+00 | 1.87 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 4.900000e-01 | 1.53 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.100000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.13 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | ▇▁▁▁▁ |
| reviews_per_month | 2087 | 0.87 | 6.800000e-01 | 1.72 | 1.000000e-02 | 1.100000e-01 | 2.700000e-01 | 6.200000e-01 | 1.078400e+02 | ▇▁▁▁▁ |
#Storing the price as a number
listings_clean <- listings %>%
filter(name != "",
host_location != "") %>%
mutate(price = parse_number(price),
bathrooms = parse_number(gsub( "\\s.*", "", bathrooms_text))) %>%
filter(bathrooms != "")
typeof(listings_clean$price)
[1] "double"
skim(listings_clean)
| Name | listings_clean |
| Number of rows | 15957 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 23 |
| Date | 5 |
| logical | 7 |
| numeric | 39 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 15957 | 0 |
| name | 0 | 1.00 | 1 | 240 | 0 | 15647 | 0 |
| description | 205 | 0.99 | 1 | 1000 | 0 | 15578 | 0 |
| neighborhood_overview | 5628 | 0.65 | 1 | 1000 | 0 | 9437 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 15809 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 14073 | 0 |
| host_name | 0 | 1.00 | 1 | 33 | 0 | 5133 | 0 |
| host_location | 0 | 1.00 | 2 | 86 | 0 | 555 | 0 |
| host_about | 6954 | 0.56 | 1 | 8920 | 0 | 7489 | 16 |
| host_response_time | 0 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 0 | 1.00 | 2 | 4 | 0 | 57 | 0 |
| host_acceptance_rate | 0 | 1.00 | 2 | 4 | 0 | 91 | 0 |
| host_thumbnail_url | 0 | 1.00 | 55 | 106 | 0 | 14046 | 0 |
| host_picture_url | 0 | 1.00 | 57 | 109 | 0 | 14046 | 0 |
| host_neighbourhood | 5655 | 0.65 | 4 | 35 | 0 | 68 | 0 |
| host_verifications | 0 | 1.00 | 2 | 158 | 0 | 360 | 0 |
| neighbourhood | 5628 | 0.65 | 11 | 59 | 0 | 66 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 38 | 0 | 22 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 66 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 0 | 1.00 | 6 | 16 | 0 | 23 | 0 |
| amenities | 0 | 1.00 | 2 | 1488 | 0 | 14532 | 0 |
| license | 11441 | 0.28 | 6 | 63 | 0 | 3609 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-07 | 2021-09-08 | 2021-09-07 | 2 |
| host_since | 0 | 1.00 | 2008-09-24 | 2021-08-31 | 2015-04-02 | 3231 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-07 | 2021-09-08 | 2021-09-07 | 2 |
| first_review | 2050 | 0.87 | 2011-04-19 | 2021-09-07 | 2017-11-01 | 2462 |
| last_review | 2050 | 0.87 | 2011-06-12 | 2021-09-07 | 2019-06-10 | 2015 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 0 | 1 | 0.12 | FAL: 13969, TRU: 1988 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 15928, FAL: 29 |
| host_identity_verified | 0 | 1 | 0.67 | TRU: 10620, FAL: 5337 |
| neighbourhood_group_cleansed | 15957 | 0 | NaN | : |
| calendar_updated | 15957 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.96 | TRU: 15315, FAL: 642 |
| instant_bookable | 0 | 1 | 0.23 | FAL: 12281, TRU: 3676 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.118191e+07 | 13504046.15 | 2.818000e+03 | 1.019025e+07 | 1.930569e+07 | 3.106922e+07 | 5.208280e+07 | ▇▇▆▅▂ |
| scrape_id | 0 | 1.00 | 2.021091e+13 | 0.00 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 6.963628e+07 | 90162614.81 | 3.159000e+03 | 9.667780e+06 | 2.956976e+07 | 9.006509e+07 | 4.210037e+08 | ▇▂▁▁▁ |
| host_listings_count | 0 | 1.00 | 2.020000e+00 | 23.45 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.992000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 0 | 1.00 | 2.020000e+00 | 23.45 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.992000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 5.237000e+01 | 0.02 | 5.229000e+01 | 5.236000e+01 | 5.236000e+01 | 5.238000e+01 | 5.243000e+01 | ▁▁▇▅▁ |
| longitude | 0 | 1.00 | 4.890000e+00 | 0.04 | 4.760000e+00 | 4.860000e+00 | 4.890000e+00 | 4.910000e+00 | 5.070000e+00 | ▁▆▇▁▁ |
| accommodates | 0 | 1.00 | 2.840000e+00 | 1.31 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| bathrooms | 0 | 1.00 | 1.200000e+00 | 0.41 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.500000e+00 | 1.300000e+01 | ▇▁▁▁▁ |
| bedrooms | 866 | 0.95 | 1.530000e+00 | 0.95 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+01 | ▇▁▁▁▁ |
| beds | 80 | 0.99 | 1.760000e+00 | 1.47 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.300000e+01 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 1.573700e+02 | 173.87 | 4.000000e+00 | 9.800000e+01 | 1.300000e+02 | 1.800000e+02 | 8.000000e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 3.970000e+00 | 21.01 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.077800e+02 | 540.94 | 1.000000e+00 | 2.100000e+01 | 1.125000e+03 | 1.125000e+03 | 1.825000e+03 | ▇▁▁▇▁ |
| minimum_minimum_nights | 0 | 1.00 | 3.930000e+00 | 21.01 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 4.130000e+00 | 21.06 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 6.849200e+02 | 532.67 | 1.000000e+00 | 2.700000e+01 | 1.125000e+03 | 1.125000e+03 | 1.825000e+03 | ▆▁▁▇▁ |
| maximum_maximum_nights | 0 | 1.00 | 2.698519e+05 | 24041164.00 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 4.000000e+00 | 21.02 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.100000e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 2.692394e+05 | 23986772.28 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 2.142625e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 4.010000e+00 | 8.68 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| availability_60 | 0 | 1.00 | 8.810000e+00 | 18.10 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 6.000000e+01 | ▇▁▁▁▁ |
| availability_90 | 0 | 1.00 | 1.427000e+01 | 28.45 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.000000e+00 | 9.000000e+01 | ▇▁▁▁▁ |
| availability_365 | 0 | 1.00 | 5.506000e+01 | 107.55 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.600000e+01 | 3.650000e+02 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 2.462000e+01 | 56.74 | 0.000000e+00 | 2.000000e+00 | 8.000000e+00 | 2.200000e+01 | 8.770000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 1.380000e+00 | 7.47 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.220000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 3.200000e-01 | 2.01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.370000e+02 | ▇▁▁▁▁ |
| review_scores_rating | 2050 | 0.87 | 4.690000e+00 | 0.67 | 0.000000e+00 | 4.670000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 2261 | 0.86 | 4.810000e+00 | 0.35 | 0.000000e+00 | 4.750000e+00 | 4.910000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 2260 | 0.86 | 4.700000e+00 | 0.44 | 0.000000e+00 | 4.600000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 2269 | 0.86 | 4.850000e+00 | 0.31 | 0.000000e+00 | 4.820000e+00 | 4.950000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 2264 | 0.86 | 4.870000e+00 | 0.30 | 0.000000e+00 | 4.850000e+00 | 4.980000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 2269 | 0.86 | 4.730000e+00 | 0.33 | 0.000000e+00 | 4.600000e+00 | 4.810000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 2269 | 0.86 | 4.600000e+00 | 0.39 | 0.000000e+00 | 4.500000e+00 | 4.670000e+00 | 4.830000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 1.630000e+00 | 2.37 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.100000e+00 | 1.88 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 4.700000e-01 | 1.46 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.100000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.14 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | ▇▁▁▁▁ |
| reviews_per_month | 2050 | 0.87 | 6.800000e-01 | 1.72 | 1.000000e-02 | 1.100000e-01 | 2.700000e-01 | 6.200000e-01 | 1.078400e+02 | ▇▁▁▁▁ |
#explore the distribution of bedrooms and beds
ggplot(listings_clean) +
geom_density(aes(x=bedrooms))+
theme_bw()
ggplot(listings_clean) +
geom_density(aes(x=beds))+
theme_bw()
ggplot(listings_clean) +
geom_density(aes(x=price))+
theme_bw()
> We noticed that for each variable there are a lot of outliers outside where the majority data concentrate and we would like to remove those uncommon observations later.
# assumptions of the followings variables are made in accordance to common choices for tourists
listings_clean1 <- listings_clean %>% filter(bedrooms < 5)
listings_clean3 <- listings_clean1 %>% filter (minimum_nights <= 4)
listings_clean4 <- listings_clean3 %>% filter (price < 1500) # Everything above that would be extraordinarily expensive
listings_clean5 <- listings_clean4 %>% filter (accommodates > 1)
# cleaning the data and choosing the relevant variables
listings_smaller <- listings_clean5 %>%
select(host_location,
neighbourhood,
neighbourhood_cleansed,
neighbourhood_group_cleansed,
property_type,
room_type,
accommodates,
bathrooms,
bathrooms_text,
bedrooms,
beds,
price,
review_scores_rating,
minimum_nights,
maximum_nights,
property_type,
latitude,
longitude,
host_is_superhost,
instant_bookable,
availability_30,
reviews_per_month)
# investigate the correlation among some key variables
listings_smaller %>%
select(price, bedrooms, beds, bathrooms, review_scores_rating) %>%
ggpairs(aes(alpha = 0.4))
#box plot for number of beds (1 and 2) relative to prices
ggplot(listings_smaller, aes(x = price)) +
geom_boxplot() +
facet_wrap(~bedrooms)
#comparison of review scores and prices
ggplot(listings_smaller, aes(x = review_scores_rating, y = price)) +
geom_point(aes(alpha = 0.4))
Plot 1 showcases that 2 bedrooms are relatively more (but only slightly) expensive than 1 bedroom in Amsterdam; however the range and variability of prices of 2 bedrooom rooms appears to be greater than a 1 bedroom, with the 75% percentile being significantly higher for 2 bedrooms compared to 1. The moderately strong positive correlation of 0.378 between bedrooms and price also indicates the same. Similarly, the plots also indicate that Higher ratings is also associated with higher prices, even though the correlation between the two variables is only weakly positive.
0 review ratings do not tell us much about the quality of the accomodation nor the price. It can be explained by that new listing usually don’t have reviews and 0 ratings are therefore not informative
Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.
# property type
listings_clean5 %>%
group_by(property_type) %>%
summarise(count = n()) %>%
arrange(desc(count))
| property_type | count |
|---|---|
| Entire rental unit | 7806 |
| Private room in rental unit | 1406 |
| Entire residential home | 794 |
| Entire townhouse | 325 |
| Private room in bed and breakfast | 270 |
| Entire condominium (condo) | 261 |
| Private room in residential home | 234 |
| Entire loft | 199 |
| Houseboat | 157 |
| Boat | 126 |
| Room in boutique hotel | 125 |
| Private room in townhouse | 122 |
| Private room in houseboat | 95 |
| Private room in guest suite | 91 |
| Room in hotel | 83 |
| Private room in boat | 72 |
| Entire serviced apartment | 69 |
| Private room in condominium (condo) | 67 |
| Private room in loft | 44 |
| Room in bed and breakfast | 31 |
| Entire guest suite | 25 |
| Private room in hostel | 21 |
| Entire villa | 17 |
| Room in serviced apartment | 16 |
| Shared room in rental unit | 16 |
| Entire guesthouse | 12 |
| Private room | 12 |
| Room in aparthotel | 12 |
| Private room in guesthouse | 11 |
| Private room in serviced apartment | 10 |
| Private room in farm stay | 7 |
| Private room in villa | 6 |
| Entire cabin | 5 |
| Entire cottage | 5 |
| Entire place | 5 |
| Room in hostel | 5 |
| Tiny house | 5 |
| Barn | 4 |
| Entire bed and breakfast | 4 |
| Entire bungalow | 4 |
| Private room in casa particular | 4 |
| Private room in tiny house | 4 |
| Entire home/apt | 3 |
| Shared room in houseboat | 3 |
| Camper/RV | 2 |
| Entire chalet | 2 |
| Floor | 2 |
| Private room in cabin | 2 |
| Private room in dome house | 2 |
| Private room in floor | 2 |
| Shared room in bed and breakfast | 2 |
| Shared room in residential home | 2 |
| Tower | 2 |
| Campsite | 1 |
| Earth house | 1 |
| Private room in bungalow | 1 |
| Private room in earth house | 1 |
| Private room in island | 1 |
| Private room in nature lodge | 1 |
| Shared room in boat | 1 |
| Shared room in hostel | 1 |
#creating a simplified version of `property_type`
listings_clean5 <- listings_clean5 %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Entire residential home", "Entire townhouse","Private room in rental unit") ~ property_type,
TRUE ~ "Other"
))
As shown in the table above, the four most common property types in Airbnb Amsterdam are “Entire rental unit”, “private room in rental unit”, “entire condo” and “entire residential home”. Collectively they account for the majority of the property types available in Amsterdam. The significantly high number of row houses, condos and houses with lofts relative to other major cities like Rotterdam (more metropolitan with numerous skyscapers) also justifies the relatively high number of condos and lofts.
Use the code below to check that prop_type_simplified was correctly made.
listings_clean5 %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))
| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 7806 |
| Private room in rental unit | Private room in rental unit | 1406 |
| Entire residential home | Entire residential home | 794 |
| Entire townhouse | Entire townhouse | 325 |
| Private room in bed and breakfast | Other | 270 |
| Entire condominium (condo) | Other | 261 |
| Private room in residential home | Other | 234 |
| Entire loft | Other | 199 |
| Houseboat | Other | 157 |
| Boat | Other | 126 |
| Room in boutique hotel | Other | 125 |
| Private room in townhouse | Other | 122 |
| Private room in houseboat | Other | 95 |
| Private room in guest suite | Other | 91 |
| Room in hotel | Other | 83 |
| Private room in boat | Other | 72 |
| Entire serviced apartment | Other | 69 |
| Private room in condominium (condo) | Other | 67 |
| Private room in loft | Other | 44 |
| Room in bed and breakfast | Other | 31 |
| Entire guest suite | Other | 25 |
| Private room in hostel | Other | 21 |
| Entire villa | Other | 17 |
| Room in serviced apartment | Other | 16 |
| Shared room in rental unit | Other | 16 |
| Entire guesthouse | Other | 12 |
| Private room | Other | 12 |
| Room in aparthotel | Other | 12 |
| Private room in guesthouse | Other | 11 |
| Private room in serviced apartment | Other | 10 |
| Private room in farm stay | Other | 7 |
| Private room in villa | Other | 6 |
| Entire cabin | Other | 5 |
| Entire cottage | Other | 5 |
| Entire place | Other | 5 |
| Room in hostel | Other | 5 |
| Tiny house | Other | 5 |
| Barn | Other | 4 |
| Entire bed and breakfast | Other | 4 |
| Entire bungalow | Other | 4 |
| Private room in casa particular | Other | 4 |
| Private room in tiny house | Other | 4 |
| Entire home/apt | Other | 3 |
| Shared room in houseboat | Other | 3 |
| Camper/RV | Other | 2 |
| Entire chalet | Other | 2 |
| Floor | Other | 2 |
| Private room in cabin | Other | 2 |
| Private room in dome house | Other | 2 |
| Private room in floor | Other | 2 |
| Shared room in bed and breakfast | Other | 2 |
| Shared room in residential home | Other | 2 |
| Tower | Other | 2 |
| Campsite | Other | 1 |
| Earth house | Other | 1 |
| Private room in bungalow | Other | 1 |
| Private room in earth house | Other | 1 |
| Private room in island | Other | 1 |
| Private room in nature lodge | Other | 1 |
| Shared room in boat | Other | 1 |
| Shared room in hostel | Other | 1 |
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:
minimum_nights?minimum_nights?listings %>%
count(minimum_nights) %>%
arrange(desc(n))
| minimum_nights | n |
|---|---|
| 2 | 5993 |
| 3 | 3918 |
| 1 | 2658 |
| 4 | 1367 |
| 5 | 895 |
| 7 | 481 |
| 6 | 234 |
| 14 | 109 |
| 10 | 92 |
| 30 | 63 |
| 8 | 33 |
| 21 | 32 |
| 20 | 30 |
| 25 | 19 |
| 12 | 18 |
| 15 | 18 |
| 28 | 16 |
| 60 | 15 |
| 9 | 12 |
| 90 | 11 |
| 13 | 7 |
| 59 | 7 |
| 180 | 7 |
| 11 | 4 |
| 19 | 4 |
| 23 | 4 |
| 29 | 4 |
| 50 | 4 |
| 58 | 4 |
| 100 | 4 |
| 16 | 3 |
| 27 | 3 |
| 31 | 3 |
| 200 | 3 |
| 300 | 3 |
| 365 | 3 |
| 18 | 2 |
| 26 | 2 |
| 42 | 2 |
| 45 | 2 |
| 99 | 2 |
| 150 | 2 |
| 1e+03 | 2 |
| 24 | 1 |
| 33 | 1 |
| 37 | 1 |
| 38 | 1 |
| 63 | 1 |
| 70 | 1 |
| 75 | 1 |
| 80 | 1 |
| 89 | 1 |
| 94 | 1 |
| 95 | 1 |
| 120 | 1 |
| 181 | 1 |
| 183 | 1 |
| 186 | 1 |
| 222 | 1 |
| 240 | 1 |
| 500 | 1 |
| 999 | 1 |
| 1e+03 | 1 |
| 1.1e+03 | 1 |
# Filtered out everything above 4 min. nights in listings_clean
ggplot(listings_clean5, aes(x=minimum_nights)) +
geom_histogram()
The most common values appearing for the variable “minimum nights” are 1,2 and 3 nights. Out of these 3 values, 2 nights is by far the most commonly occuring value. Given the fact that Amsterdam is amongst one of the top cities for a weekend visit by tourists residing in UK and western Europe, a minimum of 2 nights is seems logical. Similarly, 30 nights is the 9th highest with it being only 1 of 2 values to be greater than 10. The fact that any reservation >29 days is considered as a “long-term stay” and the host can earn significant benefits from the same could explain why a minimum of 30 nights is so common.
Filter the airbnb data so that it only includes observations with minimum_nights <= 4
Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.
The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet
leaflet(data = filter(listings_clean5)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)
The concentration of the listings are around central amsterdam and south-east Amsterdam. This is an expected result given that central and south-east amsterdam tend to the key tourist spots like the “nine streets” and thus justifies the greater concentration of listings in this areas.
For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.
Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?
Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
review_scores_rating in terms of price_4_nights.prop_type_simplified in terms of price_4_nights.We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.
listings_clean5 <- listings_clean5 %>%
mutate(price_4_nights = price * 4) %>% # we did not include accomodates because the given price is already including all people
filter(review_scores_rating > 0) # we exclude 0's because new listing usually don't have reviews and are therefore not informative
ggplot(listings_clean5, aes(x = price_4_nights)) +
geom_density()
ggplot(listings_clean5, aes(x = log(price_4_nights))) +
geom_density()
model1 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = listings_clean5)
summary(model1)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating, data = listings_clean5)
Residuals:
Min 1Q Median 3Q Max
-3.5924 -0.2978 -0.0340 0.2566 2.2791
Coefficients:
Estimate Std. Error t value
(Intercept) 5.8030257 0.0606399 95.696
prop_type_simplifiedEntire residential home 0.1895242 0.0179221 10.575
prop_type_simplifiedEntire townhouse 0.2700822 0.0267483 10.097
prop_type_simplifiedOther -0.0760105 0.0116508 -6.524
prop_type_simplifiedPrivate room in rental unit -0.5310931 0.0140631 -37.765
number_of_reviews -0.0005345 0.0000725 -7.372
review_scores_rating 0.1174017 0.0126436 9.285
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home < 2e-16 ***
prop_type_simplifiedEntire townhouse < 2e-16 ***
prop_type_simplifiedOther 7.14e-11 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews 1.80e-13 ***
review_scores_rating < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4416 on 11049 degrees of freedom
Multiple R-squared: 0.1714, Adjusted R-squared: 0.171
F-statistic: 380.9 on 6 and 11049 DF, p-value: < 2.2e-16
model2 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = listings_clean5)
summary(model2)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + room_type, data = listings_clean5)
Residuals:
Min 1Q Median 3Q Max
-3.5892 -0.2860 -0.0347 0.2469 2.3326
Coefficients:
Estimate Std. Error t value
(Intercept) 5.931e+00 5.876e-02 100.931
prop_type_simplifiedEntire residential home 1.914e-01 1.731e-02 11.053
prop_type_simplifiedEntire townhouse 2.728e-01 2.584e-02 10.556
prop_type_simplifiedOther 2.384e-01 1.603e-02 14.876
prop_type_simplifiedPrivate room in rental unit 1.688e-02 2.388e-02 0.707
number_of_reviews -1.873e-04 7.118e-05 -2.632
review_scores_rating 8.942e-02 1.226e-02 7.297
room_typeHotel room -2.502e-01 5.595e-02 -4.472
room_typePrivate room -5.660e-01 2.029e-02 -27.898
room_typeShared room -6.546e-01 9.434e-02 -6.939
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home < 2e-16 ***
prop_type_simplifiedEntire townhouse < 2e-16 ***
prop_type_simplifiedOther < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 0.47951
number_of_reviews 0.00851 **
review_scores_rating 3.15e-13 ***
room_typeHotel room 7.84e-06 ***
room_typePrivate room < 2e-16 ***
room_typeShared room 4.18e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4266 on 11046 degrees of freedom
Multiple R-squared: 0.2269, Adjusted R-squared: 0.2263
F-statistic: 360.3 on 9 and 11046 DF, p-value: < 2.2e-16
vif(model2)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.176895 4 1.195657
number_of_reviews 1.147273 1 1.071108
review_scores_rating 1.016980 1 1.008455
room_type 4.304831 3 1.275438
The variable (log(price_4_nights)) has been used as it effectively captures the compunding effect; it also allows for an earier and more meaningful interpretation of the regression results (in percentage terms). This is also showcased by the density plots; the variable price_4_nights is right skewed whereas the logged variable reflects a normal distribution bell curve.
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?(host_is_superhost) command a pricing premium, after controlling for other variables?instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nightsavalability_30 or reviews_per_month on price_4_nights, after we control for other variables?model3 <- lm(log(price_4_nights) ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
# room_type + GVIF higher than 5
bathrooms +
bedrooms +
beds +
accommodates +
host_is_superhost +
instant_bookable,
data = listings_clean5)
summary(model3)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + bathrooms + bedrooms + beds + accommodates +
host_is_superhost + instant_bookable, data = listings_clean5)
Residuals:
Min 1Q Median 3Q Max
-3.4023 -0.2356 -0.0085 0.2225 2.2901
Coefficients:
Estimate Std. Error t value
(Intercept) 5.103e+00 5.576e-02 91.516
prop_type_simplifiedEntire residential home -1.263e-01 1.668e-02 -7.570
prop_type_simplifiedEntire townhouse -4.733e-02 2.415e-02 -1.960
prop_type_simplifiedOther -7.893e-02 1.057e-02 -7.468
prop_type_simplifiedPrivate room in rental unit -4.165e-01 1.255e-02 -33.196
number_of_reviews -6.000e-04 6.596e-05 -9.095
review_scores_rating 1.180e-01 1.124e-02 10.497
bathrooms 1.322e-01 1.105e-02 11.957
bedrooms 1.180e-01 8.536e-03 13.826
beds 9.030e-03 5.463e-03 1.653
accommodates 1.245e-01 5.462e-03 22.798
host_is_superhostTRUE 7.001e-02 1.149e-02 6.095
instant_bookableTRUE 6.228e-03 9.236e-03 0.674
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home 4.02e-14 ***
prop_type_simplifiedEntire townhouse 0.0500 .
prop_type_simplifiedOther 8.77e-14 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews < 2e-16 ***
review_scores_rating < 2e-16 ***
bathrooms < 2e-16 ***
bedrooms < 2e-16 ***
beds 0.0984 .
accommodates < 2e-16 ***
host_is_superhostTRUE 1.13e-09 ***
instant_bookableTRUE 0.5002
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3832 on 11022 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.3752, Adjusted R-squared: 0.3745
F-statistic: 551.6 on 12 and 11022 DF, p-value: < 2.2e-16
vif(model3)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.569514 4 1.057963
number_of_reviews 1.220562 1 1.104790
review_scores_rating 1.049764 1 1.024580
bathrooms 1.189945 1 1.090846
bedrooms 2.980894 1 1.726527
beds 2.899654 1 1.702837
accommodates 2.784538 1 1.668693
host_is_superhost 1.216296 1 1.102858
instant_bookable 1.099468 1 1.048555
listings_clean5 <- listings_clean5 %>%
mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Oostelijk Havengebied - Indische Buurt",
"Oud-Oost",
"Watergraafsmeer",
"Noord-Oost",
"IJburg - Zeeburgereiland",
"IJburg - Zeeburgereiland",
"Gaasperdam - Driemond",
"Bijlmer-Oost")~"East",
neighbourhood_cleansed %in% c("Centrum-Oost",
"Centrum-West",
"De Pijp - Rivierenbuurt",
"Bijlmer-Centrum")~"Center",
neighbourhood_cleansed %in% c("De Baarsjes - Oud-West",
"Westerpark",
"Slotervaart",
"Bos en Lommer",
"De Aker - Nieuw Sloten",
"Osdorp",
"Geuzenveld - Slotermeer")~"West",
neighbourhood_cleansed %in% c("Zuid",
"Buitenveldert - Zuidas")~"South",
neighbourhood_cleansed %in% c("Noord-West",
"Oud-Noord")~"North"))
unique(listings_clean$neighbourhood_simplified)
NULL
model4 <- lm(log(price_4_nights) ~ neighbourhood_simplified, data = listings_clean5)
summary(model4)
Call:
lm(formula = log(price_4_nights) ~ neighbourhood_simplified,
data = listings_clean5)
Residuals:
Min 1Q Median 3Q Max
-3.6384 -0.3080 -0.0140 0.2786 2.2069
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.41097 0.00806 795.435 < 2e-16 ***
neighbourhood_simplifiedEast -0.18461 0.01335 -13.831 < 2e-16 ***
neighbourhood_simplifiedNorth -0.21179 0.02186 -9.688 < 2e-16 ***
neighbourhood_simplifiedSouth -0.08287 0.01751 -4.732 2.25e-06 ***
neighbourhood_simplifiedWest -0.18407 0.01102 -16.705 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4774 on 11051 degrees of freedom
Multiple R-squared: 0.03132, Adjusted R-squared: 0.03097
F-statistic: 89.32 on 4 and 11051 DF, p-value: < 2.2e-16
model5 <- lm(log(price_4_nights) ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
# room_type + GVIF higher than 5
bathrooms +
bedrooms +
beds +
accommodates +
host_is_superhost +
instant_bookable +
neighbourhood_simplified +
availability_30 +
reviews_per_month,
data = listings_clean5)
summary(model5)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + bathrooms + bedrooms + beds + accommodates +
host_is_superhost + instant_bookable + neighbourhood_simplified +
availability_30 + reviews_per_month, data = listings_clean5)
Residuals:
Min 1Q Median 3Q Max
-3.5230 -0.2182 -0.0018 0.2091 2.4078
Coefficients:
Estimate Std. Error t value
(Intercept) 5.188e+00 5.215e-02 99.489
prop_type_simplifiedEntire residential home -4.658e-02 1.616e-02 -2.882
prop_type_simplifiedEntire townhouse -1.639e-02 2.257e-02 -0.726
prop_type_simplifiedOther -1.315e-01 1.024e-02 -12.852
prop_type_simplifiedPrivate room in rental unit -4.428e-01 1.177e-02 -37.638
number_of_reviews -7.300e-04 6.843e-05 -10.668
review_scores_rating 1.288e-01 1.048e-02 12.291
bathrooms 1.215e-01 1.031e-02 11.784
bedrooms 1.372e-01 7.983e-03 17.193
beds 6.232e-03 5.098e-03 1.222
accommodates 1.150e-01 5.102e-03 22.535
host_is_superhostTRUE 5.013e-02 1.075e-02 4.664
instant_bookableTRUE -8.320e-03 8.669e-03 -0.960
neighbourhood_simplifiedEast -2.519e-01 1.016e-02 -24.797
neighbourhood_simplifiedNorth -3.391e-01 1.707e-02 -19.870
neighbourhood_simplifiedSouth -1.323e-01 1.318e-02 -10.036
neighbourhood_simplifiedWest -1.799e-01 8.376e-03 -21.474
availability_30 1.147e-02 4.384e-04 26.162
reviews_per_month -1.232e-02 2.258e-03 -5.456
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home 0.00396 **
prop_type_simplifiedEntire townhouse 0.46792
prop_type_simplifiedOther < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews < 2e-16 ***
review_scores_rating < 2e-16 ***
bathrooms < 2e-16 ***
bedrooms < 2e-16 ***
beds 0.22156
accommodates < 2e-16 ***
host_is_superhostTRUE 3.14e-06 ***
instant_bookableTRUE 0.33721
neighbourhood_simplifiedEast < 2e-16 ***
neighbourhood_simplifiedNorth < 2e-16 ***
neighbourhood_simplifiedSouth < 2e-16 ***
neighbourhood_simplifiedWest < 2e-16 ***
availability_30 < 2e-16 ***
reviews_per_month 4.99e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3572 on 11016 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.4575, Adjusted R-squared: 0.4566
F-statistic: 516 on 18 and 11016 DF, p-value: < 2.2e-16
vif(model5)
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.827265 4 1.078264
number_of_reviews 1.511856 1 1.229575
review_scores_rating 1.051581 1 1.025466
bathrooms 1.192647 1 1.092084
bedrooms 3.000618 1 1.732229
beds 2.906443 1 1.704829
accommodates 2.796295 1 1.672212
host_is_superhost 1.225863 1 1.107187
instant_bookable 1.114740 1 1.055813
neighbourhood_simplified 1.162419 4 1.018991
availability_30 1.165880 1 1.079759
reviews_per_month 1.424323 1 1.193450
autoplot(model5)
Removed variable beds due to high correlation with bedrooms and variable instant_bookable which is insignificant in model 5.
model6 <- lm(log(price_4_nights) ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
# room_type + GVIF higher than 5
bathrooms +
bedrooms +
accommodates +
host_is_superhost +
neighbourhood_simplified +
availability_30 +
reviews_per_month,
data = listings_clean5)
summary(model6)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + bathrooms + bedrooms + accommodates +
host_is_superhost + neighbourhood_simplified + availability_30 +
reviews_per_month, data = listings_clean5)
Residuals:
Min 1Q Median 3Q Max
-3.5216 -0.2174 -0.0021 0.2091 2.4143
Coefficients:
Estimate Std. Error t value
(Intercept) 5.170e+00 5.154e-02 100.311
prop_type_simplifiedEntire residential home -4.574e-02 1.614e-02 -2.834
prop_type_simplifiedEntire townhouse -1.552e-02 2.256e-02 -0.688
prop_type_simplifiedOther -1.344e-01 1.006e-02 -13.360
prop_type_simplifiedPrivate room in rental unit -4.443e-01 1.172e-02 -37.909
number_of_reviews -7.229e-04 6.839e-05 -10.571
review_scores_rating 1.315e-01 1.040e-02 12.641
bathrooms 1.218e-01 1.032e-02 11.808
bedrooms 1.413e-01 7.359e-03 19.208
accommodates 1.178e-01 4.505e-03 26.158
host_is_superhostTRUE 4.923e-02 1.072e-02 4.593
neighbourhood_simplifiedEast -2.517e-01 1.015e-02 -24.802
neighbourhood_simplifiedNorth -3.384e-01 1.707e-02 -19.819
neighbourhood_simplifiedSouth -1.315e-01 1.318e-02 -9.976
neighbourhood_simplifiedWest -1.799e-01 8.366e-03 -21.500
availability_30 1.150e-02 4.365e-04 26.333
reviews_per_month -1.248e-02 2.248e-03 -5.549
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire residential home 0.0046 **
prop_type_simplifiedEntire townhouse 0.4913
prop_type_simplifiedOther < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews < 2e-16 ***
review_scores_rating < 2e-16 ***
bathrooms < 2e-16 ***
bedrooms < 2e-16 ***
accommodates < 2e-16 ***
host_is_superhostTRUE 4.41e-06 ***
neighbourhood_simplifiedEast < 2e-16 ***
neighbourhood_simplifiedNorth < 2e-16 ***
neighbourhood_simplifiedSouth < 2e-16 ***
neighbourhood_simplifiedWest < 2e-16 ***
availability_30 < 2e-16 ***
reviews_per_month 2.95e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3575 on 11039 degrees of freedom
Multiple R-squared: 0.4576, Adjusted R-squared: 0.4568
F-statistic: 582 on 16 and 11039 DF, p-value: < 2.2e-16
As you keep building your models, it makes sense to:
autoplot(model_x)huxreg(model1, model2, model3, model4, model5, model6)
| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 5.803 *** | 5.931 *** | 5.103 *** | 6.411 *** | 5.188 *** | 5.170 *** |
| (0.061) | (0.059) | (0.056) | (0.008) | (0.052) | (0.052) | |
| prop_type_simplifiedEntire residential home | 0.190 *** | 0.191 *** | -0.126 *** | -0.047 ** | -0.046 ** | |
| (0.018) | (0.017) | (0.017) | (0.016) | (0.016) | ||
| prop_type_simplifiedEntire townhouse | 0.270 *** | 0.273 *** | -0.047 | -0.016 | -0.016 | |
| (0.027) | (0.026) | (0.024) | (0.023) | (0.023) | ||
| prop_type_simplifiedOther | -0.076 *** | 0.238 *** | -0.079 *** | -0.132 *** | -0.134 *** | |
| (0.012) | (0.016) | (0.011) | (0.010) | (0.010) | ||
| prop_type_simplifiedPrivate room in rental unit | -0.531 *** | 0.017 | -0.417 *** | -0.443 *** | -0.444 *** | |
| (0.014) | (0.024) | (0.013) | (0.012) | (0.012) | ||
| number_of_reviews | -0.001 *** | -0.000 ** | -0.001 *** | -0.001 *** | -0.001 *** | |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | ||
| review_scores_rating | 0.117 *** | 0.089 *** | 0.118 *** | 0.129 *** | 0.131 *** | |
| (0.013) | (0.012) | (0.011) | (0.010) | (0.010) | ||
| room_typeHotel room | -0.250 *** | |||||
| (0.056) | ||||||
| room_typePrivate room | -0.566 *** | |||||
| (0.020) | ||||||
| room_typeShared room | -0.655 *** | |||||
| (0.094) | ||||||
| bathrooms | 0.132 *** | 0.122 *** | 0.122 *** | |||
| (0.011) | (0.010) | (0.010) | ||||
| bedrooms | 0.118 *** | 0.137 *** | 0.141 *** | |||
| (0.009) | (0.008) | (0.007) | ||||
| beds | 0.009 | 0.006 | ||||
| (0.005) | (0.005) | |||||
| accommodates | 0.125 *** | 0.115 *** | 0.118 *** | |||
| (0.005) | (0.005) | (0.005) | ||||
| host_is_superhostTRUE | 0.070 *** | 0.050 *** | 0.049 *** | |||
| (0.011) | (0.011) | (0.011) | ||||
| instant_bookableTRUE | 0.006 | -0.008 | ||||
| (0.009) | (0.009) | |||||
| neighbourhood_simplifiedEast | -0.185 *** | -0.252 *** | -0.252 *** | |||
| (0.013) | (0.010) | (0.010) | ||||
| neighbourhood_simplifiedNorth | -0.212 *** | -0.339 *** | -0.338 *** | |||
| (0.022) | (0.017) | (0.017) | ||||
| neighbourhood_simplifiedSouth | -0.083 *** | -0.132 *** | -0.131 *** | |||
| (0.018) | (0.013) | (0.013) | ||||
| neighbourhood_simplifiedWest | -0.184 *** | -0.180 *** | -0.180 *** | |||
| (0.011) | (0.008) | (0.008) | ||||
| availability_30 | 0.011 *** | 0.011 *** | ||||
| (0.000) | (0.000) | |||||
| reviews_per_month | -0.012 *** | -0.012 *** | ||||
| (0.002) | (0.002) | |||||
| N | 11056 | 11056 | 11035 | 11056 | 11035 | 11056 |
| R2 | 0.171 | 0.227 | 0.375 | 0.031 | 0.457 | 0.458 |
| logLik | -6647.689 | -6264.323 | -5067.142 | -7511.196 | -4288.503 | -4305.492 |
| AIC | 13311.379 | 12550.645 | 10162.285 | 15034.393 | 8617.006 | 8646.983 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||
# Prediction using model6; model 6 has by far the highest R^2 and is less prone to collinearity than model5
listings_selection <- listings_clean5 %>%
filter(prop_type_simplified == "Private room in rental unit",
number_of_reviews >= 10,
review_scores_rating / 5 >= 0.9)
set.seed(1234)
train_test_split <- initial_split(listings_selection, prop = 0.75)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)
rmse_train <- listings_train %>%
mutate(predictions = predict(model6, .)) %>%
summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>%
pull()
rmse_train # 0.05610197
[1] 0.1877019
rmse_test <- listings_test %>%
mutate(predictions = predict(model6, .)) %>%
summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>%
pull()
rmse_test # 0.2981508
[1] 0.8266823
listings_specifications <- tibble(prop_type_simplified = "Private room in rental unit",
number_of_reviews = 10,
review_scores_rating = 0.9 * 5,
# room_type + GVIF higher than 5
bathrooms = 2,
bedrooms = 2,
accommodates = 4,
host_is_superhost = TRUE,
neighbourhood_simplified = "Center",
availability_30 = 4,
reviews_per_month = 5)
exp(predict(model6, newdata = listings_specifications, interval = "confidence"))
fit lwr upr
1 567.2637 545.6611 589.7215
# 95% CI for price for our specifications between $545.66 and $589.72 for our trip
Regression Analysis From the Analysis above we can see that model6 is the best model to use out of the 6 models. Model6 explains us the highest variance of 45.8% in Prices while the rest are much below that with Model3 explaining 37.5% of the variance , Model5 45.7%, Model2 22.7%, Model1 17.1% and Model4 only 3.1%. Furthermore, Model6 incorporates the highest number of significant variables in model making it more robust and Explanatory than any of the other models. Hence, due to the aforementioned reasons we believe that Model6 is the best Model that should be used for further analysis.
The number of bathrooms is significant at the 1% level and “accomodates” is significant at the 0.1% level. Similarly, the variable “bedroom” is also significant at the 5% level in model 5. This indicates that the independant variables do improve the regression model and thus allow for a more sound prediction of the expected price. The accomodation variable can be interpreted as - an increase in the individuals residing leads to a 14.9% rise in price. Similarly, an increase in the bedroom and bathroom leads to a 10.2% and 11.9% increase in the relevant price respectively. This is logical given that more 2 bedrooms corressponds to a larger flat/hotel and thus is likely to cost more. Controlling for “host being a super-host” also suggests that a super-host tends to charge a 5.6% higher price than a normal host, however, this coefficient is statistically insignificant. Analogous to being a super host, the variable “instant book” is also statistically insiginicant at the 5% level and hence does not appear to have a large impact on the dependant variable (price_4_nights) The 4 variables “neighbourhood simplified south, north, east and west” are streamlined to capture the effect of location on the dependant variable. As showcased in the table above, all 4 variables are statistically significant at the 0.1% level. The coefficients are all negative which suggests that, for example, staying in a listing located in the north leads to a 28.1% decline in expected price for 4 nights. This also appears to be logical as, people using airbnb in Amsterdam for 4 nights are likely to be tourists and thus will target a listing in central Amsterdam. This lack of demand for staying outside the central region indicates the negative effect of location on price. Lastly, the variable “availability_30” is also statistically significant at the 0.1% level and thus suggests that the availability of the listing does have an impact on the listings price. Surpirisngly, the impact of the variable “listings_per_month” is insignificant and thus does not appear to have a commanding impact on the listings price. This could be due to several factors - tourists may be more concerned with the convenience and the location of the listing rather than the reviews, similarly, tourists could also potentially use Amsterdam as a ‘gateway to western Europe’ and thus the reviews are unlikely to have a significant impact on the listings price
huxtable (https://mfa2022.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.price_4_nights.price_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.Remember to follow R Markdown etiquette rules and style; don’t have the Rmd output extraneous messages or warnings, include summary tables in nice tables (use kableExtra), and remove any placeholder texts from past Rmd templates; in other words, (i.e. I don’t want to see stuff I wrote in your final report.)